[not merge] xpu glm test#7748
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 Required 任务均已通过(本 PR 无 Required 任务),有 1 个可选任务失败(不阻塞合并)。
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
696c9a5 to
240c808
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-14 00:44:40
📋 Review 摘要
PR 概述:修复 XPU 昆仑芯 GLM 推理相关问题,同时优化 KVCache 写入/预取健壮性、修复 local_scheduler bug、新增 splitwise interrupt 命令支持
变更范围:cache_manager/、scheduler/local_scheduler.py、model_executor/layers/sample/sampler.py、worker/xpu_model_runner.py、splitwise/
影响面 Tag:[KVCache] [Scheduler] [XPU] [PD Disaggregation] [OP]
📝 PR 规范检查
标题 [not merge] xpu glm test 不含任何官方 Tag;PR 描述所有段落均为空占位符,不合规。
标题建议(可直接复制):
[BugFix][XPU] Fix XPU sampling params, local scheduler recycle bug, and KVCache storage robustness
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
修复 XPU(昆仑芯)上 GLM 模型推理时遇到的若干问题,包括采样参数越界、local scheduler 回收请求时的 IndexError/cursor 错误、KVCache storage 写入超时与预取失败的健壮性,以及新增 splitwise interrupt_requests 控制指令支持。
## Modifications
- `fastdeploy/model_executor/layers/sample/sampler.py`:XPU 平台使用 32-bit MAX_INFER_SEED(2147483646);调整 decoder offset 乘数为 32
- `fastdeploy/scheduler/local_scheduler.py`:修复 `_recycle` 中 `ids.index` 可能抛出 ValueError、cursor 无条件递减的 bug;修复过期 ID 批量移除使用错误 index 的 bug
- `fastdeploy/cache_manager/prefix_cache_manager.py`:GPU blocks 不足时改为 warning + 跳过 storage 预取;prefetch 路径增加 try/except 防护;storage 写入 token_ids 截断至实际块大小
- `fastdeploy/cache_manager/cache_transfer_manager.py`:将 `flush_token_index` 调用从 `write_back_storage_task` finally 块移至 `_run_write_back_storage` 开头
- `fastdeploy/cache_manager/transfer_factory/mooncake_store/attention_store.py`:将批量写入改为分片(slice)写入,支持总超时与分片超时控制
- `fastdeploy/worker/xpu_model_runner.py`:`ids_remove_padding` 为空时提前返回 None
- `fastdeploy/splitwise/internal_adapter_utils.py`:新增 `interrupt_requests` 控制命令处理
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [ ] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [ ] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | fastdeploy/model_executor/layers/sample/sampler.py:103 |
local_pos * 32 未加 XPU 平台守卫,影响所有硬件采样行为 |
| 🔴 Bug | fastdeploy/cache_manager/prefix_cache_manager.py:1143 |
删除 enable_output_caching 条件判断,output tokens 无条件写入 storage cache |
| 🟡 建议 | fastdeploy/scheduler/local_scheduler.py:157 |
A4 多实现同步:_recycle bug 修复是否已同步到 global/dp/splitwise scheduler |
| ❓ 疑问 | fastdeploy/cache_manager/cache_transfer_manager.py:934 |
elif attention_store 分支内重复判断 storage_backend_type == "attention_store" 冗余 |
总体评价
PR 涉及多个模块的重要修复,整体方向正确。但两处 P0 问题(XPU 采样 offset 未加平台守卫影响全平台、output caching 行为破坏性变更)需在合入前明确处理。鉴于标题标注 [not merge],建议在完善上述问题后再提正式 PR。
| offsets = paddle.where( | ||
| is_decoder, | ||
| local_pos * 4, | ||
| local_pos * 32, |
There was a problem hiding this comment.
🔴 Bug local_pos * 32 未加 XPU 平台守卫,对所有硬件生效。
同一函数中 MAX_INFER_SEED 的修改已正确使用 if current_platform.is_xpu() 守卫,但此处 offset 乘数从 4 改为 32 没有任何平台判断,会改变 GPU / DCU 等硬件上的采样随机性行为。
建议修复:
if current_platform.is_xpu():
offsets = paddle.where(
is_decoder,
local_pos * 32,
paddle.zeros_like(local_pos),
)
else:
offsets = paddle.where(
is_decoder,
local_pos * 4,
paddle.zeros_like(local_pos),
)如果 32 是全平台的正确值(4 是历史 bug),请在 PR 描述中说明原因。
| input_token_ids = token_ids + request.output_token_ids | ||
| else: | ||
| input_token_ids = token_ids | ||
| input_token_ids = token_ids + request.output_token_ids |
There was a problem hiding this comment.
🔴 Bug 删除 enable_output_caching 条件判断后,output tokens 会无条件写入 storage cache,破坏不启用 output caching 用户的预期行为。
原代码含义:只有在 enable_output_caching=True 时才把 output token ids 附加到存储 key 计算中;设为 False 时仅用 input tokens 作为存储键。删除该判断后,所有用户都会把 output block 写入 storage,可能导致:
- 预取时(只有 input tokens)永远无法命中这些带 output 后缀的存储块,造成无效写入
- 对不希望缓存 output 的场景(安全/隐私需求)产生行为破坏
建议恢复原有条件判断,或在 PR 中明确说明此变更的意图。
| expired_ids.append(request.request_id) | ||
|
|
||
| for i, expired_id in enumerate(expired_ids): | ||
| for expired_id in expired_ids: |
There was a problem hiding this comment.
🟡 建议 A4 多实现同步检查:_recycle 中修复了以下 bug:
ids.index(request_id)可能抛出ValueError(现已捕获)ids_read_cursor无条件递减(现已改为if idx < self.ids_read_cursor)self.ids.pop(i)使用枚举索引而非实际位置(现已改为self.ids = self.ids[len(expired_ids):])
请确认 global_scheduler.py / dp_scheduler.py / splitwise_scheduler.py 中是否存在相同模式,必要时同步修复,避免其他 scheduler 实现出现相同问题。
|
|
||
| elif self.storage_backend_type == "attention_store": | ||
| try: | ||
| if (self.rank == 0) and self.storage_backend_type == "attention_store": |
There was a problem hiding this comment.
❓ 疑问 此 if 条件中的 self.storage_backend_type == "attention_store" 是冗余判断——代码已处于外层 elif self.storage_backend_type == "attention_store": 分支内,该条件必然为真。
建议简化为:
if self.rank == 0:
self.storage_backend.flush_token_index(task_id, token_ids, 0, False)
Motivation
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.